Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media
نویسندگان
چکیده
Code-mixing or code-switching refer to the phenomenon of effortless and natural switching between two or more languages in a single conversation, sometimes even in a single utterance, by multilingual speakers. However, use a foreign word in a language does not necessarily mean that the speaker is code-switching, because often languages borrow lexical items from other languages. If a word is borrowed, it becomes a part of the lexicon of a language; whereas, during code-switching the speaker is aware that the conversation involves multiple languages and often the switching is intentional. Identifying whether a non-native word used by a bilingual speaker is due to borrowing or code-switching is not only of fundamental importance to theories of multilingualism, but it is also an essential prerequisite towards development of language and speech technologies for multilingual communities. In this paper, we present for the first time, a series of computational methods to identify the likeliness of a word being borrowed or code-mixed, based on the signals from social media. In particular, we use tweets from English-Hindi bilinguals from India to predict word borrowing. We first propose a method to sample a set of candidate words from the social media data using a context based clustering approach. Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; we then apply these metrics to score and rank the candidate words indicating their likeliness of being borrowed. We compare these rankings with a ground truth ranking constructed through a human judgement experiment. The Spearman’s rank correlation between the two rankings (∼ 0.62 for all the three metric variants) is more than double the value (0.26) of the most competitive existing baseline reported in the literature. Some other striking observations are – (i) the correlation is higher for the ground truth data elicited from the younger participants (age < 30) than that from the older participants; since language change is brought about by the younger generation, this possibly indicates that social media is able to provide very early signals of borrowing, and (ii) those participants who use mixed-language for tweeting the least, provide the best signals of borrowing.
منابع مشابه
All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media
In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (∼ 0.62) in predicting the borrowing likeliness compared to the best performing baseline (∼ 0.26) reported in literature. Based on this likeliness estimate w...
متن کاملAll that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media
In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this l...
متن کاملA Study on the Frequency of Occurrence and Usage of Anglicism in Speech of Young Iranian Telegram Users
This paper investigates the frequency of occurrence of English borrowed words in terms of three variables of age, gender, and educational status. To do so, a corpus including the extant files of participants in a target group of telegram social networking was selected and analyzed. The quantitative study of the data shows that the occurrence of the loanwords is much more frequent in the speech ...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملFuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition
In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1703.05122 شماره
صفحات -
تاریخ انتشار 2017